Opportunistically scheduling and adjusting time slices

ABSTRACT

Computerized methods, computer systems, and computer-readable media for governing how virtual processors are scheduled to particular logical processors are provided. A scheduler is employed to balance a load imposed by virtual machines, each having a plurality of virtual processors, across various logical processors (comprising a physical machine) that are running threads in parallel. The threads are issued by the virtual processors and often cause spin waits that inefficiently consume capacity of the logical processors that are executing the threads. Upon detecting a spin-wait state of the logical processor(s), the scheduler will opportunistically grant time-slice extensions to virtual processors that are running a critical section of code, thus, mitigating performance loss on the front end. Also, the scheduler will mitigate performance loss on the back end by opportunistically de-scheduling then rescheduling a virtual machine in a spin-wait state to render the logical processor(s) available for other work in the interim.

BACKGROUND

Large-scale networked systems are commonplace platforms employed in avariety of settings for running applications and maintaining data forbusiness and operational functions. For instance, a data center (e.g.,physical cloud computing infrastructure) may provide a variety ofservices (e.g., web applications, email services, search engineservices, etc.) for a plurality of customers simultaneously. Theselarge-scale networked systems typically include a large number ofresources distributed throughout the data center, in which each resourceresembles physical machines or virtual machines running as guests on aphysical host.

When the data center hosts multiple guests (e.g., virtual machines),these resources are scheduled to logical processors within the physicalmachines of a data center for varying durations of time. Often,mechanisms are utilized by operating system kernels to carry out thescheduling, as well as to synchronize data structures (e.g., logicalprocessors) within the physical machines. These mechanisms typicallyemploy the technique of spin waiting, which allows a logical processorthat is scheduled to a virtual machine to spend time waiting for anevent to occur without being rescheduled to another virtual machine.Generally, spin waits are consistently used in multithreadedenvironments that consider the costs associated with rescheduling avirtual machine much greater than the inefficiencies of interrupting aspin wait.

The multithreaded environments also rely on these mechanisms to schedulethreads issued by multiple virtual processors (comprising the virtualmachines) to be executed on multiple logical processors simultaneously.However, spin waits that are presently occurring on one or more of themultiple logical processors block the threads from being scheduled byothers of the multiple virtual processors. These blocked logicalprocessors create inefficiencies within the multithreaded environment.Accordingly, the general policy of allowing spin waits to achievecompletion results in under-utilization of physical machines within adata center and significant throughput reductions with respect to thelogical machines.

SUMMARY

This Summary is provided to introduce concepts in a simplified form thatare further described below in the Detailed Description. This Summary isnot intended to identify key features or essential features of theclaimed subject matter, nor is it intended to be used as an aid indetermining the scope of the claimed subject matter.

Embodiments of the present invention provide mechanisms that operatewithin a multithreaded environment and that opportunistically allow aspin wait to occur on a logical processor for a predefined period oftime, or to de-schedule a virtual processor from the logical processorperforming the spin wait for a predefined period of time beforerescheduling the virtual processor to finish a particular task. In oneembodiment, these mechanisms, such as the scheduler, are configured toreceive indications that depict a state of a spin wait that is inprogress and to act based on the spin-wait state. For instance, if thespin wait occurs while the logical processor is executing a criticalsection of code, a time-slice extension that has a reduced duration oftime may be granted to the virtual processor scheduled to the logicalprocessor. Accordingly, completion of the ongoing spin wait isaccelerated while a load imposed by a plurality of virtual processors isbalanced across resources of a physical machine.

In another embodiment, the scheduler is configured to reduce theoccurrence of long spin waits by de-scheduling a virtual processor,which is presently performing a spin wait, that has acquired a lock on alogical processor. However, upon waiting a predetermined time frameafter de-scheduling the virtual processor, the virtual processor may berescheduled to the logical processor to resolve the spin wait and,potentially, to successfully acquire the lock on the logical processor.Accordingly, the procedure of de-scheduling the virtual processor allowsanother virtual processor to perform work on the now available logicalprocessor. Further, the procedure of rescheduling after thepredetermined time frame facilitates achieving timely execution of athread, issued by the virtual processor, at the logical processor. Bycompleting execution of the thread in this way, other logicalprocessors, which have been allocated to the same virtual processor, maycommence or continue executing their respective threads with minimaldelay.

In yet another embodiment, the scheduler is configured to reduce theinefficiencies associated with scheduling a virtual processor to aremote logical processor, which is removed from memory utilized by thevirtual processor. Generally, in the context of a non-uniform memoryaccess (NUMA) topology, executing a thread issued by the virtualprocessor at a remote logical processor is inefficient because, duringexecution, the remote logical processor frequently accesses local memoryof the virtual processor that resides in a removed location. However,the scheduler can be enlightened to recognize that the remote logicalprocessor was scheduled and can be designed to allocate a reduced timeslice on the remote logical processor. In an exemplary embodiment, thereduced time slice has a duration of time associated therewith that isless than a duration of time associated with a pre-established timeslice that is generally allocated on a local logical processor.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the present invention are described in detail below withreference to the attached drawing figures, wherein:

FIG. 1 is a block diagram of an exemplary computing environment suitablefor use in implementing embodiments of the present invention;

FIG. 2 is a block diagram illustrating an exemplary cloud computingplatform, suitable for use in implementing embodiments of the presentinvention, that is configured to allocate virtual machines within a datacenter;

FIG. 3 is block diagram of an exemplary distributed multithreadenvironment illustrating virtual machines overlaid on physical machinesvia a scheduler, in accordance with an embodiment of the presentinvention;

FIG. 4 is a block diagram of an exemplary distributed multithreadenvironment where virtual processors are interacting with a physicalmachine via the scheduler, in accordance with an embodiment of thepresent invention;

FIGS. 5-7 are schematic depictions of schemes for scheduling virtualprocessors to physical processors upon the virtual processors acquiringa lock thereto, in accordance with embodiments of the present invention;

FIG. 8 is a flow diagram showing a front-end method for prolongingallocation of a logical processor to a virtual processor, in accordancewith an embodiment of the present invention; and

FIG. 9 is a flow diagram showing a back-end method for de-scheduling afirst virtual processor from a logical processor upon acquiring a lockthereto, in accordance with an embodiment of the present invention.

DETAILED DESCRIPTION

The subject matter of embodiments of the present invention is describedwith specificity herein to meet statutory requirements. However, thedescription itself is not intended to limit the scope of this patent.Rather, the inventors have contemplated that the claimed subject mattermight also be embodied in other ways, to include different steps orcombinations of steps similar to the ones described in this document, inconjunction with other present or future technologies. Moreover,although the terms “step” and/or “block” may be used herein to connotedifferent elements of methods employed, the terms should not beinterpreted as implying any particular order among or between varioussteps herein disclosed unless and except when the order of individualsteps is explicitly described.

Embodiments of the present invention relate to methods, computersystems, and computer-readable media for dynamically scheduling virtualprocessors to logical processors, based on a present state of thelogical processors, in order to implement front-end and back-endmitigation of inefficiencies caused by spin waits. In one aspect,embodiments of the present invention relate to one or morecomputer-readable media having computer-executable instructions embodiedthereon that, when executed, perform a front-end method for prolongingallocation of a logical processor to a first virtual processor.Initially, the method includes the step of detecting an expiration of aninitial time slice awarded to the first virtual processor that hasacquired a lock on the logical processor. Typically, the initial timeslice expires after the logical processor executes a thread, issued fromthe first virtual processor, for a predetermined duration of time.

A determination of whether the first virtual processor is executing acritical section of code associated with the thread is performed. Whenthe determination indicates that the first virtual processor isexecuting the critical section of code, the method may involve grantingthe first virtual processor a first time-slice extension. Generally, thefirst time-slice extension allocates the logical processor for executingthe thread for a reduced duration of time. In an exemplary embodiment,the first time-slice extension is shorter in duration than the initialtime slice. The method may further include periodically inspecting thelogical processor to ascertain whether the critical section of code isstill being executed, and, if so, granting additional time-sliceextensions.

In another aspect, embodiments of the present invention relate to acomputer system for reducing runtime of a thread being executed at anode that is remotely located from memory utilized by a virtualprocessor. Initially, the computer system includes a first node residingon a physical machine, a second node residing on the physical machine,and a scheduler running on the physical machine. In one instance, thesecond node is remotely located from the memory associated with thefirst node. In operation, the scheduler is configured to receive anindication that the virtual processor is attempting to execute a threadand to ascertain that one or more logical processors in the first nodeare blocking the thread. Typically, the memory that is local to thevirtual processor is included in the first node. In an exemplaryembodiment, the scheduler is configured to schedule a reduced time sliceon a logical processor in the second node selected to execute thethread. In this embodiment, a duration of time associated with thereduced time slice is less than a duration of time associated with apre-established time slice generally allocated on the logical processorsin the first node.

In yet another aspect, embodiments of the present invention relate to acomputerized method for de-scheduling a first virtual processor from alogical processor upon acquiring a lock thereto. In one embodiment, themethod involves identifying that the first virtual processor hasacquired a lock on the logical processor. Typically, the virtualprocessor is configured to execute a thread issued by the first virtualprocessor upon acquiring the lock. The method may further involveinspecting the logical processor to determine a duration of a spin wait.As discussed herein, the phrase “spin wait” generally pertains to theperformance of nonproductive loops while attempting to execute thethread at the logical processor. The spin-wait duration may be comparedagainst a time threshold, where the time threshold represents apredefined number of the nonproductive loops performed consecutively bythe logical processor. When the spin-wait duration exceeds the timethreshold, a scheduler may be employed to de-schedule the first virtualprocessor from the logical processor for a predetermined time frame.Also, the scheduler may schedule a second virtual processor to thelogical processor for an interim time slice.

Having briefly described an overview of embodiments of the presentinvention, an exemplary operating environment suitable for implementingembodiments of the present invention is described below.

Referring to the drawings in general, and initially to FIG. 1 inparticular, an exemplary operating environment for implementingembodiments of the present invention is shown and designated generallyas computing device 100. Computing device 100 is but one example of asuitable computing environment and is not intended to suggest anylimitation as to the scope of use or functionality of embodiments of thepresent invention. Neither should the computing environment 100 beinterpreted as having any dependency or requirement relating to any oneor combination of components illustrated.

Embodiments of the present invention may be described in the generalcontext of computer code or machine-useable instructions, includingcomputer-executable instructions such as program components, beingexecuted by a computer or other machine, such as a personal dataassistant or other handheld device. Generally, program componentsincluding routines, programs, objects, components, data structures, andthe like refer to code that performs particular tasks, or implementsparticular abstract data types. Embodiments of the present invention maybe practiced in a variety of system configurations, including handhelddevices, consumer electronics, general-purpose computers, specialtycomputing devices, etc. Embodiments of the invention may also bepracticed in distributed computing environments where tasks areperformed by remote-processing devices that are linked through acommunications network.

With continued reference to FIG. 1, computing device 100 includes a bus110 that directly or indirectly couples the following devices: memory112, one or more processors 114, one or more presentation components116, input/output (I/O) ports 118, I/O components 120, and anillustrative power supply 122. Bus 110 represents what may be one ormore busses (such as an address bus, data bus, or combination thereof).Although the various blocks of FIG. 1 are shown with lines for the sakeof clarity, in reality, delineating various components is not so clear,and metaphorically, the lines would more accurately be grey and fuzzy.For example, one may consider a presentation component such as a displaydevice to be an I/O component. Also, processors have memory. Theinventors hereof recognize that such is the nature of the art andreiterate that the diagram of FIG. 1 is merely illustrative of anexemplary computing device that can be used in connection with one ormore embodiments of the present invention. Distinction is not madebetween such categories as “workstation,” “server,” “laptop,” “handhelddevice,” etc., as all are contemplated within the scope of FIG. 1 andreference to “computer” or “computing device.”

Computing device 100 typically includes a variety of computer-readablemedia. By way of example, and not limitation, computer-readable mediamay comprise Random Access Memory (RAM); Read Only Memory (ROM);Electronically Erasable Programmable Read Only Memory (EEPROM); flashmemory or other memory technologies; CDROM, digital versatile disks(DVDs) or other optical or holographic media; magnetic cassettes,magnetic tape, magnetic disk storage or other magnetic storage devices,or any other medium that can be used to encode desired information andbe accessed by computing device 100.

Memory 112 includes computer storage media in the form of volatileand/or nonvolatile memory. The memory may be removable, nonremovable, ora combination thereof. Exemplary hardware devices include solid-statememory, hard drives, optical-disc drives, etc. Computing device 100includes one or more processors that read data from various entitiessuch as memory 112 or I/O components 120. Presentation component(s) 116present data indications to a user or other device. Exemplarypresentation components include a display device, speaker, printingcomponent, vibrating component, etc. I/O ports 118 allow computingdevice 100 to be logically coupled to other devices including I/Ocomponents 120, some of which may be built-in. Illustrative componentsinclude a microphone, joystick, game pad, satellite dish, scanner,printer, wireless device, etc.

Turning now to FIG. 2, a block diagram is illustrated, in accordancewith an embodiment of the present invention, showing an exemplary cloudcomputing platform that is configured to allocate physical machines 211,212, and 213 within a data center 200 for use by one or more virtualmachines. It will be understood and appreciated that the cloud computingplatform shown in FIG. 2 is merely an example of one suitable computingsystem environment and is not intended to suggest any limitation as tothe scope of use or functionality of embodiments of the presentinvention. Neither should the cloud computing platform 200 beinterpreted as having any dependency or requirement related to anysingle component or combination of components illustrated therein.Further, although the various blocks of FIG. 2 are shown with lines forthe sake of clarity, in reality, delineating various components is notso clear, and metaphorically, the lines would more accurately be greyand fuzzy.

The cloud computing platform includes the data center 200 that iscomprised of interconnected physical machines 211, 212, and 213, whichare configured to host and support operation of virtual machines. Inparticular, the physical machines 211, 212, and 213 may include one ormore nodes that have logical processors for running operations, tasks,or threads issued by the logical machines. These nodes may bepartitioned within hardware of the physical machines 211, 212, and 213in order to isolate applications or program components running thereon.However, the nodes may be connected across the hardware of a physicalmachine via hubs (not shown) that allow for a task, command, or thread(i.e., issued by an application or program component) being executed ona remote node to access memory at another node that is local to theapplication or the program component. The phrase “application,” as usedherein, broadly refers to any software, service application, or portionsof software, that runs on top of, or accesses storage locations within,the data center 200.

By way of example, the physical machine 211 could possibly be equippedwith two individual nodes, a first node 220 and a second node 230.However, it should be understood that other configurations of thephysical machine 211 are contemplated (i.e., equipped with any number ofnodes). The first node 220 and the second node 230 each include separateresources in the physical machine 211, but can communicate via a hub(not shown) to access remote memory. Often this type of a communicationinvolves consuming significant resources and, thus, is more expensivethan running the processes in isolation on the respective first node 220and second node 230. Further, the first node 220 and the second node 230may be provisioned with physical processors. For instance, the firstnode 220 may be provisioned with a set of physical processors 225 thatincludes logical processors LP1, LP2, LP3, and LP4. Similarly, thesecond node 230 may include a set of physical processors 235 thatincludes logical processors LP5, LP6, LP7, and LP8. In this embodiment,both the nodes 225 and 235 resemble multicore, or QuadCore, processorsthat are constructed with multiple physical cores (e.g., LP1-LP8) forprocessing threads in parallel. Although specific configurations ofnodes are depicted, it should be appreciated and understood thatthreads, tasks, and commands from the virtual machines may be executedby various processing devices which are different in configuration fromthe specific illustrated embodiments above. For instance, any number oflogical processors, working in conjunction with other resources (e.g.,software and/or hardware), can be used to carry out operations assignedto the nodes 225 and 235. Therefore it is emphasized that embodiments ofthe present invention are not limited only to the configurations shownand described, but embrace a wide variety of computing device designsthat fall within the spirit of the claims.

Typically, the logical processors LP1-LP8 represent some form of acomputing unit (e.g., central processing unit, microprocessor, blades ofa server, etc.) to support operations of the virtual machines runningthereon. As utilized herein, the phrase “computing unit” generallyrefers to a dedicated computing device with processing power and storagememory, which supports one or more operating systems or other underlyingsoftware. In one instance, the computing unit is configured withtangible hardware elements, or machines, that are integral, or operablycoupled, to the nodes 220 and 230, or the physical machines 211, 212,and 213, within the data center 200 to enable each device to perform avariety of processes and operations. In another instance, the computingunit may encompass a processor coupled to a computer-readable mediumaccommodated by the nodes 220 and 230. Generally, the computer-readablemedium stores, at least temporarily, a plurality of computer softwarecomponents that are executable by the processor. As utilized herein, theterm “processor” is not meant to be limiting and may encompass anyelements of the computing unit that act in a computational capacity. Insuch capacity, the processor may be configured as a tangible articlethat processes instructions. In an exemplary embodiment, processing mayinvolve fetching, decoding/interpreting, executing, and writing backinstructions.

Per embodiments of the present invention, the physical machines 211,212, and 213 represent any form of computing devices, such as a personalcomputer, a desktop computer, a laptop computer, a mobile device, aconsumer electronic device, server(s), blades in a stack, the computingdevice 100 of FIG. 1, and the like. In one instance, the physicalmachines 211, 212, and 213 host and support the operations of thevirtual machines assigned thereto, while simultaneously hosting othervirtual machines, or guests, created for supporting other customers ofthe data center 200. In operation, these guests support serviceapplications owned by those customers.

In one aspect, the nodes 220 and 230 operate within the context of thecloud computing platform and, accordingly, communicate internallythrough connections dynamically made between the physical machines 211,212, and 213, and externally through a physical network topology toother resources, such as a remote network (e.g., enterprise privatenetwork). The connections may involve interconnecting via a networkcloud 280. The network cloud 280 interconnects these resources such thatthe node 220 may recognize a location of the node 230, and other nodes,in order to establish communication pathways therebetween. In addition,the network cloud 280 may establish this communication over channelsconnecting the nodes 220 and 230. By way of example, the channels mayinclude, without limitation, one or more local area networks (LANs)and/or wide area networks (WANs). Such networking environments arecommonplace in offices, enterprise-wide computer networks, intranets,and the Internet. Accordingly, the network is not further describedherein.

Turning now to FIG. 3, a block diagram is illustrated that shows anexemplary distributed multithread environment 300 depicting virtualmachines 320 and 330 overlaid on physical machines, such as the physicalmachine 211, via schedulers 250 and 251, in accordance with anembodiment of the present invention. In one embodiment, the virtualmachines 320 and 330 may represent the portions of software and hardwarethat participate in running a service application. The virtual machines320 and 330 are typically maintained by a virtualization layer, such asthe respective schedulers 250 and 251, that virtualizes hardware, suchas the first node 220 and second node 230, for executing commands,tasks, and threads. In one example, the first virtual machine 320includes a first virtualization stack 325 of virtual processors (VP1,VP2, VP3, and VP4) that is associated with the scheduler 250. In thisexample, the second virtual machine 330 includes a second virtualizationstack 335 of virtual processors (VP5, VP6, VP7, and VP8) that isassociated with the scheduler 251. In this example, the scheduler 250 isconfigured to schedule threads (illustrated as dashed lines), issued bythe virtual processors VP1-VP4, to the logical processors provisionedwithin the first node 220 and the second node 230, respectively. Thescheduler 251 is configured to schedule threads (illustrated as dashedlines), issued by the virtual processors VP5-VP8, to the logicalprocessors provisioned within another instance of the physical machine212, as discussed above with reference to FIG. 2.

By way of example, the scheduler 250 allocates time slices on thelogical processors to execute threads, such that the logical processorscan support a multitude of threads issued from the virtual processorsVP1-VP4 plus other virtual processors (not shown) in tandem. In anexemplary embodiment, the scheduler 250 allocates time slices forVP1-VPX, where X is greater than four (i.e., hosting many virtualprocessors on fewer logical processors). In this situation, the numberof virtual processors outnumber a number of logical processors so thereis not a one-to-one correlation therebetween, thus, the scheduler 250 isconfigured to dynamically manage usage of logical processors to balancea changing load imposed by many virtual processors. As used herein, thephrase “time slice” is not meant to be limiting, but may encompass ashare of computing resources (e.g., CPU and/or memory) that is grantedto a virtual processor to execute a thread, or some other work issued bythe virtual processor.

Generally, the virtual processors VP1-VPX of the virtual machine 320 areallocated on one or more logical processors to support functionality ofa service application, where allocation is based on demands (e.g.,amount of processing load) applied by the service application. As usedherein, the phrase “virtual machine” is not meant to be limiting and mayrefer to any software, application, operating system, or program that isexecuted by a logical processor to enable the functionality of a serviceapplication running in a data center. Further, the virtual machines 320and 330 may access processing capacity, storage locations, and otherassets within the data center to properly support the serviceapplication.

In operation, the virtual processors VP1-VPX comprising the virtualmachine 320 are dynamically scheduled to resources (e.g., logicalprocessors LP1-LP4 of FIG. 2) within a physical computer system insidethe data center. In a particular instance, threads issued from thevirtual processors are dynamically awarded time slices on logicalprocessors to satisfy a current processing load. In embodiments, ascheduler 250 is responsible for automatically allocating time slices onthe logical processors. By way of example, the scheduler 250 may rely ona service model (e.g., designed by a customer that owns the serviceapplication) to provide guidance on how and when to allocate time sliceson the logical processors.

As used herein, the term “scheduler” is not meant to be limiting, butmay refer to any logic, heuristics, or rules that are responsible forscheduling the virtual processors VP1-VPX, or any other virtualprocessors, on available logical processors. In an exemplary embodiment,the scheduler 250 attempts to select the optimal, or best suited,logical processor to accept and execute a particular virtual processor.Upon selection, the scheduler 250 may proceed to allocate a time sliceon the optimal logical processor and to place the thread thereon. Thesedecisions (e.g., selection, allocation, and scheduling) performed by thescheduler 250 are imperative to the proper and timely performance of aservice application. Further, it is advantageous to use efficientalgorithms when making the decisions.

In embodiments, the schedulers 250 and/or 251 represent local schedulersthat are running on each instance of a physical machine individually. Asillustrated, the scheduler 250 is running on the physical machine 211,while the scheduler 212 is running on the physical machine 212.Accordingly, the schedulers 250 and 251 illustrated in FIG. 3 manageworkload within a particular physical machine, where such physicalmachines include a scheduler (hypervisor), a single root partition, andone virtualization stack. The physical machines 211 and 212 make up aportion of the data center which is configured to host the virtualmachines.

As more fully discussed below, the embodiments of the present inventionrelate to opportunistically scheduling threads, thereby reducing spinwaits. By way of example, the scheduler 250 may include a hypervisor. Inoperation, the hypervisor manages CPUs and memory in the physicalmachine 211 and is responsible for multiplexing the logical processorsonto many virtual processors. The hypervisor manages virtual processorsbelonging to virtual machines hosted within the data center and provideoptimal performance characteristics for guests (e.g., virtual machines320 and 330) that run on top of the logical processors.

In a particular instance, the hypervisor is charged with schedulinglogical processors in a way that maintains a parity (of access to thelogical processors) among the virtual processors VP1-VPX, thus promotingfairness within the system. This type of a scheduling may involveimplementing a selection scheme that attempts to evenly distribute theallocated time slices of the logical processors between the virtualprocessors VP1-VPX, while still opportunistically granting extended timeslices to particular virtual processors when certain conditions aredetected (e.g., executing a critical section of code). Thus, via themethods discussed below, the hypervisor can mitigate front-endinefficiencies caused by unenlightened de-scheduling of the virtualmachines and can mitigate back-end inefficiencies by preemptivelyde-scheduling those virtual machines having issued threads presently ina spin-wait state. In other embodiments, the hypervisor looks at apriority of a virtual processor, an amount of time awarded the virtualprocessor with respect to time awarded to other virtual processors,and/or other criteria when deciding how to schedule the logicalprocessors to the virtual processors VP1-VPX.

It will be understood and appreciated that the hypervisor includedwithin the scheduler 250 shown in FIG. 3 is merely an example ofsuitable logic to support the service application and is not intended tosuggest any limitation as to the scope of use or functionality ofembodiments of the present invention.

Each of the virtual machines 320 and 330 may be associated with one ormore virtual processors configured as root partitions. Typically, onevirtual machine is associated with a single root partition, asillustrated in FIG. 3. As further illustrated in FIG. 3, the firstvirtual machine 320 is associated with a root partition 340 while thesecond virtual machine 330 is associated with a root partition 350. Asdescribed herein, the root partitions 340 and 350 generally pertain tomechanisms that support the input/output activity of the other virtualprocessors VP1-VPX of the virtual machines 320 and 330. In this role,the root partitions 340 and 350 allow for communication between thevirtual machines 320 and 330 and may be responsible for networking withmany other virtual machines (e.g., via a direct access to the network)by leveraging a network card, disk, or other hardware. In one instanceof operation, the root partitions 340 and 350 are configured to generatea request at local virtual machines 320 and 330, respectively, via adedicated channel in the hardware, and to convey the request to remotevirtual machines, thereby enforcing security and isolation of thevirtual machines.

In an exemplary embodiment, the first node 220 and the second node 230form a computer system that, when managed by the scheduler 250, iscapable of reducing runtime of a thread being executed at one node(e.g., second node 230) that is remotely located from memory on anothernode (e.g., first node 220) being utilized by a virtual processor.Generally, if a virtual processor within either virtualization stack 325or 335 occupies memory (not shown) in the first node 220, this memory islocal to the logical processor of the first node 220. The presence oflocal memory enables efficient execution of a thread issued from thevirtual processor when the thread is scheduled to the logical processorsof the first node 220. In contrast, when the thread issued by thevirtual processor is scheduled to a remote logical processor on thesecond node 230, any access to the memory on the first node 220 isinefficient because communication is conducted via a hub, which connectsthe first node 220 and the second node 230 across a hardware partition.This is true even in the situation where the first node 220 and thesecond node 230 are carved out of resources of the same physical machine211.

Accordingly, embodiments of the present invention address thisinefficiency by configuring the scheduler 250 to allocate longer timeslices on local logical processors residing on the first node 220, wherethe virtual processor is associated with memory in the first node 220,and to allocate shorter time slices on remote logical processorsresiding on the second node 230. In particular implementations of thisallocation scheme, the scheduler 250 is initially configured to receivean indication that a virtual processor is attempting to execute athread. The indication may be based on the operating system detectingone of the root partitions 340 or 350 is attempting to performinput/output work on behalf of other virtual machines, thus acting in ahosting capacity. Also, this indication may be provided by the operatingsystem installed on the physical machine 211.

Upon receiving the indication, the scheduler 250 may initially ascertainwhether one or more local logical processors in the first node 220 areavailable, where memory that is local to the virtual processor isincluded in the first node 220. If it is determined that the first node220 lacks the available resources to execute the thread, the scheduler250 may inspect the second node 320 to ascertain its presentavailability. If there exists a remote logical processor in the secondnode 230 that can execute the thread, that remote logical processor isscheduled to execute the thread. As such, even though this remotelogical processor will likely not execute the thread as efficiently asthe local logical processor, the scheduler prioritizes fulfillingrequests from the virtual processors in a timely manner over waiting forthe most-efficient resources to become available.

However, because the remote logical processor is not as efficient as thelocal logical processor, the scheduler 250 may allocate a reduced timeslice on the remote logical processor in the second node 230. In anexemplary embodiment, the duration of time associated with the reducedtime slice is less than a duration of time associated with apre-established time slice generally allocated on the local logicalprocessors in the first node 220. By way of example, the reduced timeslice may be associated with a duration of time lasting 100 microseconds(μs), while the pre-established time slice may be associated with aduration of time lasting 10 milliseconds (ms). In this way, thescheduler can make opportunistic time slice adjustments for threadsrunning on remote logical processors in nonideal nodes, such as thesecond node 230 in this example. This technique employed by thescheduler 250 for decreasing time slices on nonideal nodes, incomparison to time slices allocated on preferred nodes, can be appliedto a nonuniform memory access (NUMA) topology to improve overall systemperformance.

By decreasing time slices on nonideal nodes, the scheduler 250 reducesruntime of a thread being executed on the remote logical processors.But, because root partitions 340 and 350 often exhibit bursty behavior,where a compact set of tasks are requested in a sporadic fashion, thereduced runtime is still generally adequate to satisfy the needs of theroot partitions 340 and 350. If the runtime is not adequate (i.e., thereduced time slice scheduled on the remote logical processor in thenonideal node elapsed), the scheduler 250 can return to the preferrednode (first node 220) to check for local-logical-processor availability.Accordingly, this sampling approach provides the scheduler 250 withopportunities to optimize the scheduling of the pending threads, suchthat threads are attended to in a reasonable time frame whileinefficient scheduling is limited.

Turning now to FIG. 4, a block diagram is shown that illustrates anexemplary distributed multithread environment 400 where the firstvirtualization stack 325 of virtual processors VP1-VP4 are interactingwith logical processors 225 of the physical machine 211 via thescheduler 250, in accordance with an embodiment of the presentinvention. It should be appreciated and understood that this interactionillustrated in FIG. 4 is exemplary and intended to explain oneembodiment of operation of the scheduler 250.

Initially, a thread 405 from virtual processor VP1 is identified by thescheduler 250. Upon identification, the scheduler queries the logicalprocessors 225 to find a logical processor that is available. In thisexemplary interaction, logical processor LP1 is found to be available toexecute the thread. Accordingly, the scheduler 250 allocates an initialtime slice on LP1 such that LP1 can begin execution of the thread 405.As discussed below, the scheduler 250 may opportunistically createavailability on LP1 by de-scheduling a thread 415 issued from virtualprocessor VP2. This creation of availability may be in response todetecting that the thread 415 was performing a spin wait for an extendednumber of consecutive cycles. As illustrated, reference numeral 410depicts the de-scheduled thread 415 waiting in queue to be rescheduledat LP1 where the thread 415 had previously acquired a lock. The threads425 and 435, issued from the virtual processors VP3 and VP4,respectively, are shown as residing in spin-wait states 420 and 430. Asdiscussed above, spin waits consume resources if left to cycle for anextended amount of time. Accordingly, it is advantageous to expend aminimal amount of power to govern scheduling of the threads 405, 415,425, and 435, as opposed to allowing spin waits to progress unchecked.

With reference to FIGS. 5-7, schematic depictions of schemes forscheduling virtual processors to physical processors, upon the virtualprocessors acquiring a lock thereto, are shown in accordance withembodiments of the present invention. Initially, the schemes areformatted as bar charts with up to three physical processors LP1, LP2,and LP3 represented on the y-axis while some period of time isrepresented on the x-axis. The hash marks on the x-axis are meant todepict a linear procession of time and not to indicate an actualduration of time slices that are allocated on the physical machines.

Referring to the schematic depiction of the scheme in FIG. 5, thisscheme demonstrates issues that may occur when a virtual processor VP1has acquired a lock, or spinlock, on two separate logical processors LP1and LP3 in tandem. Initially, VP1 acquires a lock on LP3 and isallocated a time slice 520 to execute a critical section of codeassociated with a thread. Before completing execution, VP1 acquires alock on LP1 and is allocated a time slice 510 to execute anothercritical section of code associated with a thread. Successfullyexecuting this other critical section of code depends on the completedexecution of the critical section assigned to LP3. However, before LP3can complete execution of the critical section assigned thereto, VP1 isde-scheduled from LP3 and VP2 is scheduled at time slice 530. As such,the execution of the critical section on LP1 enters a spin wait untilVP1 is rescheduled on LP3, at 540, and completes execution of thecritical section. Upon LP3 executing the critical section of codeassigned thereto, VP1 releases the lock on LP3 and LP1 is able tocomplete execution of the critical code in time slice 510.

A shown, the inopportune de-scheduling of VP1, while executing thecritical section of code on LP3, causes the inefficient allocation ofmultiple resources, LP1 and LP3, to VP1 for an extended duration oftime. As more fully discussed with reference to FIG. 6, if VP1 had achance to finish the critical section on LP3, instead of beingde-scheduled, VP1 would have released the lock on LP1 earlier. As aresult, LP1 would have had greater availability (e.g., not as manyvirtual processors attempting to access LP1 during the time slice 510would have been blocked). In practice, the scheme of FIG. 5 is a problemthat is inherent to the scheduling of virtual machines and preventsproper scaling of multiprocessor virtual machines if there are nomitigations in place.

Referring to the schematic depiction of the scheme in FIG. 6, thisscheme demonstrates a front-end method employed by a scheduler forprolonging scheduling of the logical processor LP1 to the virtualprocessor VP1. In this case, prolonging scheduling promotes conservingcomputing resources, such as time consumed on logical processor LP1. Asdiscussed above with reference to FIG. 5, VP1 acquires a lock on LP3. Inaddition, time slice 620 is initially awarded to VP1 during which LP3executes a critical section of code associated with a thread issued byVP1. The front-end method asks the scheduler to detect an expiration ofthe initial time slice 620 awarded to VP1, where the initial time slice620 typically expires after a predetermined duration of time. Thefront-end method also allows the scheduler to recognize that the initialtime slice 620 expired before LP3 has had the opportunity to fullyexecute the critical section of a thread, as discussed immediatelybelow.

In embodiments, the scheduler may perform a determination step toascertain whether LP3 is executing a critical section of code associatedwith the thread issued from VP1. In one instance, determining whether avirtual processor is executing a critical section of code involvesreceiving an indication, or hint, from an operating system that thevirtual processor is in a critical section of code and that othervirtual processors could become blocked if the virtual processor thatacquired the lock were to be de-scheduled. The operating system mayidentify the execution of the critical section by examining a taskpriority register (TPR) that exposes a level of importance of thethreads being executed. In this embodiment, the operating system maydetermine that any threads with a TPR importance level above a thresholdvalue (e.g., value of 2) are deemed to be in working the criticalsection. In another embodiment, the operating system may glean that theexecution of a critical section of code is occurring by inspecting aninterrupt service routine. In yet another embodiment, the operatingsystem may ascertain that a logical processor is executing a criticalsection of code by identifying the logical processor is running asynchronizing region of code, which determines resources to be accessedby a virtual processor.

When the scheduler determines that the LP3 is executing the criticalsection of code, the scheduler may grant VP1 a first time-sliceextension 630 in order to facilitate LP3 completing the critical sectionbefore de-scheduling VP1. In an exemplary embodiment, the firsttime-slice extension 630 allocates LP3 to VP1 for a reduced duration oftime in comparison to the predetermined duration of time associated withthe initial time slice 620. By way of example, the initial time slice620 may have a predetermined duration of 10 ms, while the time-sliceextension 630 may have reduced duration of 100 μs. By reducing theduration of the time-slice extension 630, inequities between virtualprocessors attempting to access a particular logical processor arediminished. However, a length of the reduced duration of time associatedwith the time-slice extension 630 may be adjusted based upon a prioritylevel attached to the thread, or based on a number of virtual processorswithin a virtual machine that are supported by a particular logicalprocessor.

In the instance that the scheduler determines that LP3 is not executinga critical section of code for VP1, VP1 may be de-scheduled from LP3 inorder to allow other virtual processors to access LP3 and executethreads thereon. By way of example, upon de-scheduling VP1 from LP3, thescheduler may grant virtual processor VP2 time on LP3. In this example,a time slice (not shown) is awarded to VP2 that may be substantiallyequivalent in duration to the predetermined duration of time associatedwith the initial time slice 620 awarded to VP1.

Returning to the instance where the scheduler recognized LP3 is runninga critical section of code and has granted VP1 the time-slice extension630, the front-end method may invoke the scheduler to perform adetermination of whether LP3 is continuing to execute the criticalsection of code associated with the thread from VP1. When it isdetermined LP3 is continuing to execute the critical section of code,the scheduler may grant a second time-slice extension 640 to VP1. Aswith the time-slice extension 630, the time-slice extension 640allocates LP3 to VP1 in order to execute the thread for another reducedduration of time. In an exemplary embodiment, the reduced duration oftime associated with the first time-slice extension 630 and with thesecond time-slice extension 640 are substantially equivalent.

Although various configurations of the time slices and time-sliceextensions have been described, it should be understood and appreciatedthat other suitable durations of time slices and time-slice extensionsthat allocate a logical processor to a virtual processor may be used,and that embodiments of the present invention are not limited to thosedurations described herein. For instance, the durations of thetime-slice extensions 630, 640, and 650 may vary (e.g., grow iterativelyshorter in length), or the duration of time slices awarded to VP1 andVP2 may differ.

Upon awarding the second time-slice extension 640 to VP1, the schedulerwill again review indications from the operating system to understandwhether the critical section of code is still being executed at LP3. Ifso, the scheduler may again award a time-slice extension, such as thethird time-slice extension 650, to VP1. Advantageously, the time-sliceextensions 630, 640, and 650 allow LP3 to complete executing thecritical section and permit LP1 to proceed with executing a thread forVP1. In this case, LP1 depends on LP3 finalizing execution of thecritical section in order for LP1 to fully carry out its execution.

In embodiments, the front-end method continues until either thescheduler receives an indication that LP3 has completed execution of thecritical section for LP1, or LP1 releases its lock on LP3. With respectto the latter embodiment, the scheduler may periodically inspect LP3 todetermine whether the lock acquired by LP1 is being held. When the lockon LP3 is identified as being released, the front-end method calls forthe scheduler to arrest its periodic inspection of LP3 and to refrainfrom granting VP1 an additional time-slice extension.

In one instance, the scheduler is configured to learn from the logicalprocessors' interactions with the virtual processors. For instance, thescheduler, or another entity residing on the physical machine, maymonitor a frequency at which time-slice extensions are granted to LP1,LP2, and any other virtual processor. Based on the monitoring, a patternmay be generated that reflects the frequency of granting the time-sliceextensions. In operation, the scheduler may apply this pattern to adjustthe reduced duration of time associated with the time-slice extensionsor to determine a maximum number of time-slice extensions that may beawarded to a particular virtual processor.

Referring to the schematic depiction of the scheme in FIG. 7, thisscheme demonstrates a back-end method employed by a scheduler forde-scheduling the virtual processor VP1 from the logical processor LP2in order to conserve resources, such as time consumed on logicalprocessor LP2. Initially, VP1 acquires a lock on LP1 and is awarded atime slice 710 to execute a first critical section of code. Then VP1acquires a lock on LP2 and is awarded a time slice 720 to perform asecond critical section of code, where the second critical sectiondepends on LP1 finalizing execution of the first critical section at 710before it can advance beyond a certain point. Accordingly, VP1 enters aspin wait on LP2 during the time slice 720. As discussed above, spinwaits pertain to performing nonproductive loops while attempting toexecute the thread at a logical processor and are often inefficientmethods for holding a lock on a logical processor.

While VP1 has acquired a lock on both LP1 and LP2, the scheduler mayperform the back-end method for de-scheduling VP1 from LP2 after it hasacquired a lock and has entered a spin wait. Initially, the back-endmethod calls for the scheduler to identify that VP1 has acquired a lockon LP2, and that LP2 is executing a thread issued by VP1 upon acquiringthe lock. In further accordance with the back-end method, the schedulermay inspect LP2 to determine a duration of a spin wait. This spin-waitduration may be compared against a time threshold. In one instance, thetime threshold represents a predefined number of the nonproductive loops(e.g., 4095 cycles) performed consecutively by the logical processor. Inanother instance, the time threshold is based on a predefined, staticperiod of time. In yet another instance, the time threshold isdynamically tuned based on recorded behavior of the virtual processors,such as the pattern explained above.

When the scheduler determines that the spin-wait duration does not meetthe time threshold, it may allow the LP2 to continue attempting toexecute the thread issued by VP1 at time slice 720. In contrast, whenthe scheduler determines that the spin-wait duration on LP2 exceeds thetime threshold, VP1 is de-scheduled from LP2 for a predetermined timeframe. In this way, the scheduler notices that no useful work is beingperformed on LP2 at the present time and allows other ready threads tobe scheduled on the LP2 to improve overall system throughput.

In addition, the scheduler may schedule another virtual processor VP2 toLP2 for an interim time slice 730 (e.g., 100 microseconds). At some timeafter awarding the interim time slice 730 to VP2, the scheduler mayrecognize that the interim time slice 730 has elapsed. Upon elapse, theback-end method instructs the scheduler to reschedule VP1 to LP2 for atime slice 740. At some time after rescheduling VP1 to LP2, thescheduler may detect that the thread being executed at LP2 has entered asubsequent spin wait. The schedule may again ascertain whether aduration of the subsequent spin wait exceeds the threshold time. If so,VP1 is again de-scheduled from LP2. Upon de-scheduling VP1 for a secondtime, the scheduler may schedule another virtual processor to LP2 for aninterim time slice 750. This other virtual processor may be VP2 or athird virtual processor (not shown).

Eventually, LP1 will complete execution of the critical section beingrun at time slice 710, and VP1 will release the lock on LP1. At thispoint, the thread running on LP2 will exit a spin-wait state andcommence productive execution. This is indicated at time slice 760. Timeslice 760 is illustrated as extended because VP1 is not de-scheduledfrom LP2 when it is not in a spin-wait state. Accordingly, the schedulerallocates LP2 to VP1, when performing productive execution, until thecritical section is completely executed and VP1 releases its lock onLP2.

Turning now to FIG. 8, a flow diagram is shown that articulates afront-end method 800 for prolonging allocation of a logical processor toa virtual processor, in accordance with an embodiment of the presentinvention. Initially, the method 800 includes the step of detecting anexpiration of an initial time slice awarded to the first virtualprocessor that has acquired a lock on the logical processor, asindicated at block 802. Typically, the initial time slice expires afterthe logical processor executes a thread, issued from the first virtualprocessor, for a predetermined duration of time. As indicated at block804, a determination of whether the first virtual processor is executinga critical section of code associated with the thread is performed. Whenthe determination indicates the first virtual processor is not executingthe critical section of code, the method 800 may involve de-schedulingthe first virtual processor from the logical processor (see block 806),and allowing a second virtual processor to access the logical processor(see block 808).

When the determination indicates that the first virtual processor isexecuting the critical section of code, the method 800 may involvegranting the first virtual processor a first time-slice extension, asindicated at block 810. Generally, the first time-slice extensionallocates the logical processor for executing the thread for a reducedduration of time. In an exemplary embodiment, the first time-sliceextension is shorter in duration than the initial time slice. The method800 may further include periodically inspecting the logical processor toascertain whether the critical section of code is still being executed(see block 812), and, if so, granting additional time-slice extensions(see block 814).

Turning to FIG. 9, a flow diagram is illustrated that shows anembodiment of back-end method 900 for de-scheduling a first virtualprocessor from a logical processor, upon acquiring a lock thereto. Atsome point, the method 900 involves identifying that the first virtualprocessor has acquired a lock on the logical processor, as indicated atblock 902. Typically, the virtual processor is configured to execute athread issued by the first virtual processor upon acquiring the lock.The method 900 may further involve inspecting the logical processor todetermine a duration of a spin wait, as indicated at block 904. Asindicated at block 906, the spin-wait duration may be compared against atime threshold, where the time threshold may represent a predefinednumber of the nonproductive loops performed consecutively by the logicalprocessor. As indicated at block 908, a determination of whether thespin-wait duration exceeds the time threshold is performed. When thespin-wait duration does not meet the time threshold, the virtualprocessor is allowed to continue attempting to execute the thread issuedby the first virtual processor. This step is indicated at block 910.

When the spin-wait duration exceeds the time threshold, a scheduler maybe employed to de-schedule the first virtual processor from the logicalprocessor for a predetermined time frame, as indicated a block 912.Also, as indicated at block 914, the scheduler may schedule a secondvirtual processor to the logical processor for an interim time slice. Asindicated at block 916, the scheduler may recognize that the interimtime slice awarded to the second virtual machine has elapsed. At thispoint, the first virtual processor may be rescheduled to the logicalprocessor for a time slice consistent with the time threshold. This stepis indicated at block 918.

Embodiments of the present invention have been described in relation toparticular embodiments, which are intended in all respects to beillustrative rather than restrictive. Alternative embodiments willbecome apparent to those of ordinary skill in the art to whichembodiments of the present invention pertain without departing from itsscope.

From the foregoing, it will be seen that this invention is one welladapted to attain all the ends and objects set forth above, togetherwith other advantages which are obvious and inherent to the system andmethod. It will be understood that certain features and sub-combinationsare of utility and may be employed without reference to other featuresand sub-combinations. This is contemplated by and is within the scope ofthe claims.

What is claimed is:
 1. One or more computer-storage memory havingcomputer-executable instructions embodied thereon that, when executed,perform a front-end method for prolonging allocation of a logicalprocessor to a first virtual processor, the method comprising: detectingusing a scheduler an expiration of an initial time slice awarded to thefirst virtual processor that has acquired a lock on the logicalprocessor, wherein the initial time slice expires after the logicalprocessor executes a thread, issued from the first virtual processor,for a predetermined duration of time; ascertaining, using the scheduler,whether the first virtual processor is executing a critical section ofcode associated with the thread, wherein upon de-scheduling the firstvirtual processor, a second virtual processor becomes blocked, whereinthe second virtual processor depends on complete execution of thecritical section of code by the first virtual processor, wherein thecritical section of code is identified based on receiving an indicationof a level of importance of the thread corresponding to the criticalsection of code, the level of importance meeting a predefined threshold;and when the first virtual processor is executing the critical sectionof code that is blocking additional processing of at least the secondvirtual processor, granting the first virtual processor a firsttime-slice extension; and allocating the first time-slice extension tothe logical processor for executing the thread for a reduced duration oftime.
 2. The one or more computer-storage memory of claim 1, wherein themethod further comprises, when the first virtual processor is notexecuting the critical section of code: de-scheduling the first virtualprocessor from the logical processor; allowing the second virtualprocessor to access the logical processor, wherein allowing a secondvirtual processor to access the logical processor comprises granting thesecond virtual processor a subsequent time slice that allocates thelogical processor to the second virtual processor for a predeterminedduration of time.
 3. The one or more computer-storage memory of claim 1,wherein a length of the reduced duration of time of the first time-sliceextension is longer when the first virtual processor is associated witha local node and shorter when the first virtual processor is associatedwith a remote node.
 4. The one or more computer-storage memory of claim3, wherein the remote node is remotely located from the memoryassociated with the local node.
 5. The one or more computer-storagememory of claim 4, further comprising, when the first virtual processoris associated with the remote node: determining the availability of oneor more logical processors within the local node, incident to an elapseof the first-time extension scheduled on the first virtual processor. 6.The one or more computer-storage memory of claim 1, wherein the methodfurther comprises: upon granting the first virtual processor the firsttime-slice extension, ascertaining whether the first virtual processoris continuing to execute the critical section of code associated withthe thread; and when the first virtual processor is continuing toexecute the critical section of code, granting the first virtualprocessor a second time-slice extension, wherein the second time-sliceextension allocates the logical processor to execute the thread foranother reduced duration of time.
 7. The one or more computer-storagememory of claim 1, wherein ascertaining whether the first virtualprocessor is executing a critical section of code is based on anoperating system determining a thread associated with the first virtualprocessor has a threshold task priority register value.
 8. The one ormore computer-storage memory of claim 1, wherein ascertaining whetherthe first virtual processor is executing a critical section of code isbased on the logical processor is running a synchronizing region ofcode.
 9. The one or more computer-storage memory of claim 1, the methodfurther comprising: periodically inspecting the logical processor todetermine whether the lock acquired by the first virtual processor isbeing held; and when the lock on the logical processor is identified asbeing released, arresting the periodic inspection and refraining fromgranting the first virtual processor an additional time-slice extension.10. The one or more computer-storage memory of claim 1, wherein a lengthof the reduced duration of time associated with the first time-sliceextension is based upon a priority level attached to the thread.
 11. Theone or more computer-storage memory of claim 1, wherein the methodfurther comprises: monitoring a frequency at which time-slice extensionsare granted to the first virtual processor; generating a pattern thatreflects the frequency of granting the time-slice extensions; andapplying the pattern to adjust the reduced duration of time associatedwith the time-slice extensions.
 12. One or more computer-storage memorycomputer-executable instructions embodied thereon that, when executed,perform a front-end method for prolonging allocation of a logicalprocessor to a first virtual processor, the method comprising: detectingusing a scheduler an expiration of an initial time slice awarded to thefirst virtual processor that has acquired a lock on the logicalprocessor, wherein the initial time slice expires after the logicalprocessor executes a thread, issued from the first virtual processor,for a predetermined duration of time, and wherein the predeterminedduration of time of the initial time slice is based on a determinationwhether the first virtual processor is associated with a local node or aremote node such that a longer initial time slice is allocated when thefirst virtual processor is associated with the local node and a shorterinitial time slice is allocated when the first virtual processor isassociated with the remote node; ascertaining using the schedulerwhether the first virtual processor is executing a critical section ofcode associated with the thread; and when the first virtual processor isexecuting the critical section of code, (a) blocking, based on detectingthe expiration of the initial time slice, a de-scheduling of a lock ofthe first virtual processor on the logical processor; (b) granting thefirst virtual processor a first time-slice extension; and (c) allocatingthe first time-slice extension to the logical processor for executingthe thread for a reduced duration of time.